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Abstract 

Sensitivity analysis is popular in dealing with missing data problems partic¬ 
ularly for non-ignorable missingness. It analyses how sensitively the conclusions 
may depend on assumptions about missing data e.g. missing data mechanism 
(MDM). We called models under certain assumptions sensitivity models. To 
make sensitivity analysis useful in practice we need to define some simple and 
interpretable statistical quantities to assess the sensitivity models. However, the 
assessment is difficult when the missing data mechanism is missing not at ran¬ 
dom (MNAR). We propose a novel approach in this paper on attempting to 
investigate those assumptions based on the nearest-neighbour (KNN) distances 
of simulated datasets from various MNAR models. The method is generic and 
it has been applied successfully to several specific models in this paper including 
meta-analysis model with publication bias, analysis of incomplete longitudinal 
data and regression analysis with non-ignorable missing covariates. 
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1 Introduction 


Model uncertainty and incomplete data are common issues in statistical analysis. An 
assessment of uncertainty due to incomplete data or model misspecification 


tracted many researchers’ a 


Vernuri et al.. 

1969; 

Draper. 

1995; 


Mention for sever al de cades (see e.g 


Copas and Li, 


isspecikcation 

Las at- 

Cornfield et al.. 

1959; 


1993). Let 9 be the parameter of 


interest and D a set of observations. Conventional inference employs a model f(D ; 6) 
and this usually provides a consistent estimate for 9. However, if part of the data are 
unobserved, inference based on the observed data may result in bias. In this case bias 
analysis can be carried out by using a so-called bias model f( D ; 9, p ) associated with 


different missing data mechanisms (MDM) (see e.g. 


Greenland, 


2005|), where rj is called 


the bias parameter. It is usually difficult to estimate r) due to the lack of data or prior 
knowledge. For example, if the missin gness is under so-called non-ignorable (or miss¬ 
ing not at random) MDM (Little and Ru bin. 2002) and 7] is the parameter involved 
in a MDM model, then ij cannot be estimated as it depends on the unobserved data. 
In some cases it is possible to resort to follow-up studies in order to estimate rj (see 


e.g. 


Kim and Yu, 


20ll|); in many instances, however, such investigations are inherently 


difficult to conduct due to the nature of the study as it occurs, for example, in epi¬ 
demiological studies. Also extra bias may exist due to the lack of randomization and 
the independence problem between former observations and follow-up samples. 

Sensitivity analysis is one of the commonly used approaches for assessing uncer¬ 
tainty via a bias parameter 7] or the related bias model ; hence r] is also known as the 
sensitivity parameter and its associated model the sensitivity model. As such, we will 


refer to p the sensitivity parameter in the remainder o: 


the paper. The use of these mod- 


(2000, 200 

) to explore publication 

bias in meta-analysis using the Heckman model 

(Hechman, 

1979) 

Molcnberg 

'hs et a 

• (5 

2001) to investigate incomplete contingency ta- 

bles and C 

opas and Eeuchi 

(2001, 

200 

5) to look at local sensitivity analysis. Those 


discussions characterize the sensitivity analysis in different ways, but their aims are 
essentially the same: to examine the influence of an individual point on model-based 
inference. A different approach considers all possible sources of uncertainty by defin¬ 
ing a prior density coupled with a Monte Carlo sensitivity analysis to sample l bias 


2 



































































parameters r and t hen inverts the bias model to provide a distribution of bias-corrected 

p.269). However it is usually difficult to choose and justify 


estimates’’ ([Greenland! . 


2005 


a prior density. 


Copas (20.13) pointed out that ‘a sensitivity analysis is essentially a warning of how 


sensitively the conclusions of a meta-analysis may depend on key assumptions about 
the study selection process’. Using the idea in a general sensitivity model f(D: 6, r/), we 
need to analyse how the conclusions change when 77 changes. It is usually more useful 
for practitioner if we can find some simple and interpretable statistical quantitie s to 


assess the sensitivity models instead of using r) directly. For example 


Copas and Shi 


( 2001 ) used the P-value for the goodness of fit to the funnel plot and the estimated 
number of unpublished studies. In spite of being very useful, those methods arc limited 
to special cases and it is not straightforward to extend the idea to other problems. 
In this paper we propose a generic method based on the nearest-neighbour distance 
between the observed data and the data simulated from different sensitivity models. 
We attempt to achieve two aims: (i) exclude the models which are ‘far away’ from the 
true model. We do so by comparing the KNN distances with a critical value based on 
a permutation test; (ii) in the set of plausible sensitivity models resulting from (i), find 
the most plausible model or a set of the most plausible models in terms of minimal 
distance. This will result in a set of most plausible estimates in a small range for the 
parameter of interest 6 , which would provide useful information to help practitioners to 
draw conclusions. We will refer to this method as simulation-based sensitivity analysis 
(SSA). 

The remainder of the paper is organized as follows. Section 2 will first describe the 
idea of simulation-based sensitivity analysis, followed by a detailed discussion on how 
to evaluate sensitivity models using the nearest-neighbour distance and a permutation 
test. The method is applied to a missing data problem in meta-analysis and a longi¬ 
tudinal study in Section 3; both examples are illustrated with numerical results from 
real data problems. Finally, a discussion is given in Section 4 with further examples 
presented in Appendix. 
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2 Simulation-based sensitivity analysis 


2.1 Sensitivity models and sensitivity parameters 

Let T be the population of complete data from which we wish to infer the parameter 
of interest 9 using model /(J 7 ; 6). If a sample D is drawn randomly from T, 6 calcu¬ 
lated from the model f(D]9) is usually unbiased. However, observed data, denoted 
by D 0 b s , are often not a representative of the complete data. Conventional inference 
usually employs a model f{D obs ; 6) under assumptions of missing at random (MAR) 
and this may result in bias since those assumptions are often invalid under some ‘im¬ 
perfect’ situations such as publication bias in met a-analysis, measurement error with 
non-ignorable missingness or the use of a misspecihed MDM model. The effect of bias 
source may be modeled with a bias model parametrized with a bias parameter. 

Let D = ( D obs , D m is ) be a set of complete data including both observed and unob¬ 
served data and R a missingness indicator vector which takes 1 if data is observed or 
0 otherwise. The complete data model can be factorized into an extrapolation model 
and an observed data model as follows. 


f(D,R\e) = f(D mis \D ob s,R, e m )f(D obs ,R\e o ). 


( 1 ) 


Here, 0 m and 0 O denote parameters indexing the models for missing data and observed 
data respectively. The item f(D mis \D obs , R, Q m ) is called extrapolation model since the 
missing data are often outside the range of the observed data. Usually the parameter of 
interest 6 is a subset of 0 O and the remaining components are the nuisance parameters. 
To simplify notations we will not distinguish @o from 9 and will always use the latter. 
The extrapolation model cannot usually be identified unless extra assumptions ar e 


made. The bias or sensitivity parameters ({Dan iels and Hogan, 


2008 : 


Greenland . 2005) , 


denoted by r), are a function of the parameter 0 m . These parameters are usually 
inestimable. But when their values are given, the full data model f(D, R\6 , © m ) in ((TJ) 
is identified. In another words, the estimate of 9 depends on the value of r/, and it may 
be biased if the value of p is given wrongly. We assume one value exists, denoted by 
?/o, for which the model f(D, R\9, @ m ) would provide an unbiased estimate of 9 given 
D obs and r/ 0 . 
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We now illustrate the meaning of sensitivity model and sensitivity parameters using 
an example whic h asse sses the relationship between passive smoking and lung cancer 
(Hackshaw et ah, 1997). A total of 37 published epidemiological studies are selected 
to investigate the risk of lung cancer in female non-smokers whose spouses/partners 
did or did not smoke. Suppose yi is the estimated log odds ratio reported in the ith 
study and ■s l is the corresponding standard error for % — 1 ,..., n. A random-effects 
meta-analysis model is given by 


Vi = Hi + <?&, Hi ~ N(n, r 2 ), ^ ~ N(Q, 1), i = 1,..., n. 


( 2 ) 


Here p is the overall mean effect which is the parameter of interest, r 2 is the variance 
measuring heterogeneity while of is the within-study variance and is usually replaced 
by the sample variance s 2 . Figure |Tj presents the funnel plot of the log odds ratios 
showing the sign of publication bias, i.e. smaller studies give more positive results 
than the larger studies. In other words smaller studies with inconclusive results are 
less likely to be selected; see the detailed discussion in Copas and Shi (j2Q00! ). 


Without assuming publication bias, the maximum likelihood estimate of the overall 
mean is p — 0.22. The relative risk is therefore 1.246 or the excess risk is 24.6%, which 
implies that people who live with smokers have a 24.6% excess risk to have lung cancer 
when compared with those living with non-sm okers. This is however an over-estimate 
due to publication bias as argued by Copas and Sh i (120001 ). To address the problem 
they used the following selection model 


Zi = a + b/si + S h Si ~ 1V(0,1), corr(ej, Si) = p. 


(3) 


A study is selected when Zi > 0 (i.e. Ri = 1). In this model rj = ( a,b ) are the 
sensitivity parameters. If rj = (a,b) is given, P(zi > 0 \yi) can be calculated from 
the selection model whereas the parameter of interest /j and the nuisance parameters 
(V 2 , p] ca n be estimated from the meta-analysis model (j2J) and the selection model (j2J) 
( Copas and Shi . 2000) by maximizing the following log-likelihood. 


l{p, t 2 , p) = Y^l l °sf(yi\zi> o)] 

2=1 

n 

= + log P(zi > 0| Vi) - log P(zi > 0)]. 


(4) 


2=1 


5 















Log odds ratio 


Figure 1: Passive smoking and lung cancer data: log odds ratios and their 95% 
confidence interval versus the inverse of the sample standard errors for 37 stud¬ 
ies; the dashed line represents jx = 0.22, the estimate without assuming selectivity; 
the solid line represents fitted values obtained from the most plausible model with 
(a, b, p) = (-2.6,0.8,0.165). 


However, since the number of unselected studies is unknown, r/ = (a, b ) is inestimable. 
In this example, uncertainty exists for the sensitivity parameters (a, b ) as well as the 
assumed selection model (l3lh A sensitivity analysis investigates how sensitively the 
estimate of 6 depends both on the choice of the selection model and the choice of the 
sensitivity parameter rj. 

The sensitivity parameter r] in this example has no clear physical meaning. It 
is difficult to draw any meaningful conclusion if it is used directly. To make sensi¬ 


tivity analysis usefu 


we need to develop some simple and interpretable quantities. 


Copas and Shi (2000) used the p-value of the goodness-of-fit to funnel plots and 


Copas 


(12013 1 used the overall selection probability. Those quantities have clear physical inter¬ 
pretation from which meaningful conclusions can be drawn; this idea, however, is not 
easily expandable to other missing data problems. In this paper, we propose a more 
generic method which works by comparing the observations with datasets simulated 
from candidates of the sensitivity model f(D]6,rj). The nearest-neighbour distance 
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is used to evaluate those models. In principle, this method can be applied to any 
missing data problem; we provide further details about the procedure in the next two 
subsections. 


2.2 Evaluating sensitivity models 

As we have discussed previously, f(D,R-,9,r]) is the sensitivity model, where D = 
(D 0 b s , D m i S ), 9 is the parameter of interest and rj is the sensitivity parameter. There 
is uncertainty for 77 , but 9 can be estimated when 77 is given. For example, we may 
estimate 9 through a marginal density 



The meta-analysis example discussed in the previous subsection follows this approach, 
and 9 is estimated from the marginal likelihood (J4|) for the observed data when 77 = (a, b) 
is given. 

Let 770 be the true value. The observed data D obs are therefore generated from 
either the full model f(D, R-, 9,7] 0 ) or its corresponding marginal model f(D obs , R\ 9, 770 ). 
Hence, if the value 770 is known, an unbiased estimate for 9 can be determined from 
the model and the observed data D obs ; see the discussion on the related topics in ?. 
Unfortunately 770 is usually unknown and it cannot be estimated from D obs under non- 
ignorable missingness. Our idea considers a set of sensitivity parameters denoted by 
T and investigate each 77 in T by comparing f(D,R-,9,rj ) with f(D,R;9, 770 ). If they 
are very close to each other, the candidate model would result in a (nearly) unbiased 
estimate under certain regularity conditions. Since 770 is unknown, the comparison 
cannot be made directly; an indirect approach simulates datasets from the candidate 
model and proceeds further by comparing the simulated and the observed data (bearing 
in mind that the observed data D obs come from f(D , R] 9, rj 0 )). 

The steps of this simulation-based sensitivity analysis (SSA) procedure are as fol¬ 
lows. 

(i) Select one 77 in T or generate it from a prior distribution 77 ( 77 ) if we have some 
prior knowledge about 77 ; 

(ii) Estimate 9 r/ using the sensitivity model f(D,R-,9,r /) given the selected 77 ; 
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(iii) Simulate an incomplete sample D° bs from the model f(D,R;9 v ,t 7 ); 

(iv) Calculate distance s(D ° bs , D obs )] 

(v) Repeat Steps (i) to (iv) for each candidate 77 in T. 


We have demonstrated how to calculate 9 V from f(D, i?; 9, 77 ) given 77 in Step (ii) 
using the meta-analysis and publication bias example in the previous subsection. More 
examples will be discussed later in this paper. 

In Step (iii) we first sample a complete dataset D v from /(.D, R; 6 ^, 77 ) and then 
remove some elements from D ri in order to generate the incomplete observations D ° bs ; 
this latter dataset is comparable to the actual observations, D obs ■ This process requires 
a missing data mechanism to be specified by 77 . As there is no an unified approach to 
simulate D 71 or D° bs , we discuss some viable techniques in Section [21 and Appendix A.3. 

Step (iv) calculates the distance between the simulated dataset D° bs and the ac¬ 
tual observations D obs . The key here is defining a proper measure in order to measure 
the ‘closeness’ or ‘similarity’ between the sets. This is particularly important for the 
large-dimensional case. To measure the similarity or dissimilarity between two clusters 
various statistical distances have been investigated in the literature. A common ap¬ 


proach calculates the distance between each pair of data 


p 01 m 


then use either 
plete linkage by 


he minimum (s ingle linkage by ISneathl (Il957f)h the maximum (com - 


s in D° bs and D obs , and 


Sorensen 

(1948 

)) or the average distance 

Sokal and Michcner, 

1958) 


In our experience, the first two distances do not work very well in SSA while the av¬ 
erage distance works well for some examples but the performance is not consistent. 
As an alternative, we use the K nearest-neighbour (KNN) method, first introduced 
by Fix and Hodges (1951) as a nonparametric measure; further details are given in 
Appendix A.l. This measure works well in most of the cases we have tried. 

We will discuss how to use the distance s(D° bs , D ohs ) to conduct a sensitivity analysis 
in the next subsection. 

Remark 1. The sensitivity model, either f(D, R\9, 77 ) or f(D obs , R\9, 77 ), depends on 
the sensitivity parameter 77 as well as the hierarchical structure (see equation ([!])); 
so does D° bs . Hence, the distance s(D ° bs , D obs ) can be used to investigate both the 
plausibility of 77 and the model structure. This method is therefore appropriate for the 























study of misspecified models and other related problems. 

Remark 2. Numerical problems may arise in step (iv) when calculating the distance 
between D a b s and one set of D° bs . A strategy that works in those instances makes use 
of an average distance by sampling more than one set from D° bs for a given 77 . We will 
be referring to the number of replicates taken to build the average distance as a Monte 
Carlo sample size or simply MC size. 

Remark 3. KNN distance is just a way to measure similarity between D 0 b s and 
D“ bs . It performs well for the examples discussed in this paper. However, some other 
distances or measures may be used for different types of data or problems. 


2.3 Sensitivity analysis 


The distance s(D obs , D° bs ) measures the ‘closeness’ between the sensitivity model and 
the ‘true’ model. Those models giving rise to large values of this metric can simply be 


discarded as they are very unlikely. In order to define a test criterion more forma 
borrow the idea of an ‘achieved significance level’ (ASL) introduced by (IFisherl . 


ly, we 


1971) 


using permutation tests. Let now D\ and D 2 denote a pair of permutation samples 
drawn from the combined dataset D* = (D 0 b Sl D° bs ) whose distance is s* = s(Di,D 2 ). 
We define the ASL for SSA as: 


ASL V = Pr Ho {s* > s(D obs , D ° bs )}, ( 6 ) 

where H 0 is the null hypothesis stating that D 0 b s and D° bs come from the same distri¬ 
bution. When we need to resort to average distances to tackle numerical instabilities 
as per Remark 2, permutation samples are generated over the multiple D obs with a 
given MC size. The ASL is the proportion of the pairs of the permutation samples for 
which the distances are larger than the distance between D 0 b s and D° bs . Thus ASL 
works similarly to a p-value with larger values taken as being in favour of Hq. 

Other approach to work out the ASL uses the concept of ‘internal distance’; in that 
case, the permutation sample (D 1 , D 2 ) is drawn from the D° bs instead of the combined 
dataset D*. 

For a given significance level (e.g. a = 5%), a plausible set of sensitivity parameters 
is defined as T a = {r] \ ASL V > a}. Models not included in r a should be avoided. 
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For all the models within r Q , the distance calculated for each p can be used as a 
further guide in deriving an estimate for 9. For instance, we define the ‘most plausible 
value’ of p that gives rise to the smallest distance and term the corresponding estimate 
of 9 as the ‘most plausible estimate’. It is important to emphasize, however, that this 
estimate may not be consistent regardless of the sample size. Hence, caution should 
be exercised even when the most plausible model has been chosen; we provide further 
discussion about this topic in the examples. 


3 Examples 

We will discuss two missing data problems in this section, including publication bias 
problem in meta-analysis and incomplete longitudinal data. Techniques on how to gen¬ 
erate D v from f(D: 9, rj) and how to sample D° bs from D ri will also be discussed. More 
examples will be given in Appendix, including mean estimation with non-ignorable 
missing data and non-ignorable missing covariates. 


3.1 Publication bias in meta-analysis 


We carry out the meta-analysis for the example of passive smoking and lung cancer 
discussed in Section [All In this example, p is the parameter of interest, ( r 2 ,p ) are the 
nuisance parameters and p = (a, b) are sensitivity parameters which are inestimable. 
When r) = (a, b ) are known, the marginal likelihood for the observed data is given by 

(HD, he. 


L(p,r,p\a,b) = [log f(yi\zi> 0, Si)} 


2—1 

n r 


E 

2—1 


1 ! ( 2 , 2\ (Vi 9 1 ) 2 

2 l0g(T + 


log <f>(a H-) + log<F(iy) 

*->2 


where 


o + ~ + Pi! 


Vi-S 


Si ^ ^ (r 2 +«J) 1/2 „ 

Vi = --T7TTTTW-, Pi = 






( 1 -PD 1/2 (r 2 + erf ) 1 / 2 

and $(•) is the distribution function of the standard normal distribution. Parameters 
(p, r 2 ,p) can be estimated by maximizing the above marginal likelihood given the 
observed data. 


10 










A sensitivity analysis is conducted to investigate how the estimates of p change for 
different values of rj = (a, b). The selection of possible values of a and b is based on a 
range_of selection probabilities for the largest and the smallest studies as suggested in 


Copas and Shi (2001) • This region is a G (—3, 0) and b e (0.1, 2) as shown in Figure [2] 


(a), from which a grid of values can be chosen accordingly to conduct the sensitivity 
analysis. The area towards the bottom left indicates a high potential for publication 
bias; this reduces as we move towards the top right of the figure. For each selected 
pair of (a, b), p is estimated using the maximum likelihood method. A contour plot is 
presented in Figure [2] (a). At the top right p = 0.22 which agrees with the estimate 
from a model without assuming publication bias but this value falls as the sensitivity 
parameters move towards the bottom left cor ner. A question which ari ses naturally is: 


which estimate should we use? In this regard. ICopas and Shil (120001) and lCopas and Shi 


(120011 ) have suggested several statistical quantities which are interpretablc and from 
which some answers can be found to the aforementioned question. 

We now use the SSA method to conduct a sensitivity analysis. For each pair of 
(a, b) we calculate the estimates (p, r 2 , p ) and simulate a complete dataset using meta¬ 
analysis model ([2]); denoted the simulated data by D = {(yj, Sj),j = 1,. .., N}. Given 
the known values of (a, b) and the estimates of the parameters, we subsequently draw 
random numbers of z from the selection model (J3]). Note that z is correlated with y 
and the study is not selected if z < 0. The selected studies form the observed data 
D°^ s b y This dataset is compared with the actual dataset D a b s by using KNN distance 
,D 0 b s ). In accordance with Remark 2, Section 2.2, we have used the average 
distance with 1000 replicates for each pair of (a, b ) in order to remove any numerical 
instability (i.e. the MC size is 1000). The contours of the average distances are shown 
in Figure [2] (b). 

Using the permutation method, discussed in Section 3.3, we can calculate the ASL 
for each value of ?/ and discard all pairs of sensitivity parameters whose ASL values are 
smaller than 0.05. The remaining pairs form a plausible set of sensitivity parameters 
which results in estimates for p ranging from 0.043 to 0.197. The upper bound for 
this estimate indicates an excess risk of 21.7%. The conventional method (without 
assuming publication bias) results in an excess risk of 24.6%, thus overestimating the 
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(a) fi (b) average distance 


Figure 2: Contours plots for passive smoking and lung cancer data. In (b), the area in 
red corresponds to the most plausible values for r] with the average distances being all 
very close to the smallest distance of 0.295. 


SSA upper bound by 13%. 

The most plausible value (smallest distance) is achieved when (a, b ) = (—2.6, 0.8) 
which leads to the estimate of fi = 0.165 or an excess risk of 17.9%. The fitted line is 
shown in Figure 1, showing a much better fit than the dashed line which is obtained 
from the model without considering publication bias. 

As we have mentioned before hand, caution is needed when working with the most 
plausible value. This can be seen by looking at Figure 2(b), where the average KNN 
distances are all very close to the shortest distance of 0.295 for several values of the 
sensitivity parameters. The estimates of /i for those sets range between 0.086 and 0.183 
which lead to excess risks in the interval between 9% and 20%. 

Based on the 37 studies collected in the meta-analysis and the simulation-based 
sensitivity analysis, we can conclude that the excess risk is not likely to be higher than 


21.7% or smaller than 4.4%. Most likely it will take va 


This result is consistent with the findings inlCo pas and Shi (2000). The latter is based 


ues between 9% and 20%. 


on goodness-of-fit test but the method is not easily extendable to other missing data 
problems. The SSA method proposed in this paper is more general and can be used to 
deal with different types of problems. 
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3.2 Incomplete Longitudinal Example 

We next consider a follow-up randomized study with M scheduled repeated measures. 
Let Y m be the outcome measured at visit m; we use the notation Y m = (Y \,..., Y m ) 
to denote the history of the outcomes up to visit m and W+ = (Y m+1 , ..., Ym) as the 
future outcomes after visit m. Linder monotone missingness, if the outcome at visit 
m is missing then all outcome after that visit are all missing. We denote R k as the 
indicator that Y k is observed, which equals to 1 if Y k is observed and 0 otherwise. We 
assume that Y\ is observed on all individuals = 1) and let d be the last visit on 
which a subject has a measure; it follows that the observed data for that subject is 
Yj = (Yi,...,Yd), where d < M. We are interested in the inference of the mean 
// = E(Ym)i the intended outcome at the final scheduled visit. 

The probability of dropping out at visit k can be modeled by the following logistic 
model: 

logit{1 - P(d = k\d >k,Y~, W+)} = h{Y-) + r,Y k+l 

where h(-) is an unknown function. Either parametric or nonparametric approaches 
can be used. A special case is the following linear model: h{Y) 7 ) = + hiLc- This 

model describes the probability of dropping out at visit k + 1 by a logistic linear model 
which depends on the latest recorded data Y k and the current unobserved data Y k+ \. 
If 77 = 0 this is a MAR model. One would make this choice if there is evidence or 
belief that the missingness associated to Y k +\ can be entirely encoded by the recorded 
history observations Y]7. This is certainly a strong assumption in longitudinal study 
and usually very difficult to justify. We now use the SSA method proposed in this 
paper to investigate the model under a MNAR mechanism, and try to ford a plausible 
value of rj in the neighbourhood of 0 (where 0 corresponds to the MAR model). 

Let us first consider a simple case. Suppose that we just have two visits, i.e. M = 2. 
There is no missing data for the first visit, i.e., Y\ is observed for all the subjects. But 
part of the data are missing in the second visit. Let R be the missing indicator for Y 2 . 
A semi-parametric logistic regression model can be defined by: 

P(R = 1|W, Y 2 ) = expit (/i(W) + rjY 2 ), (7) 

where h(Y\) is a nonparametric model and expit(f) = e l /{1 + e l ). The parameter r) is 


13 



not identifiable from the observed data only since it also depends on the missing part 
of Y 2 . 

We need the following lemma to apply SSA to the above problem. 


Lemma 1 Let T be the statistical variables which are partly observed. Given the ob¬ 
served data D 0 b s , the conditional density of the missing part of T (i.e. when R = 0) 
can be expressed by 

Q(D) 


f(T\D obs ,R = 0) = f(T\D obs ,R = l) 


where 


Q(D) = 


E(Q(D)\D obs ,R=iy 

P(R=0\D) 

P(R=1\D) 


( 8 ) 


(9) 


is the conditional odds of missing data. 


The proof is given in Appendix A.2. 

Using ([7l) and the above lemma, we have 

f(Y 2 \D obs , R = 0) = f(Y 2 \D obs , R = 1): 


exp(-r/U 2 ) 


'E(exp(-r ] Y 2 )\D obs ,R=iy iJl ” 
In the above formula, the conditional density of the observed part of Y 2 , f(Y 2 \D obs , R = 
1), can be obtained parametrically or non-parametrically, for example, using a gen¬ 
eralized additive mo del ( Hastie and Tibshiranil . 1990 ) or Gaussian process regression 
( Shi and Cho i. 2013). Thus the only uncertainty in formula (IlOp is the sensitivity 
parameter rj which cannot be estimated from the observed data. 

The missing data can therefore be generated from (TTOli for each given g. This forms 
a complete dataset for Y 2 . We can further select a subset using the selection model 
(jTj) which we denote as Yf h f. And, finally, this subset can be compared with the real 
observed data Yf bs by using the KNN distance. Thus we can conduct a sensitivity 
analysis for g. The details of this procedure are worked out in the next subsection with 
a real example. 

Let us now move to discuss the general case with M visits. Assume that the data 
have been recorded for all the subjects up to the k -th visit but with some missing data 
afterwards. Let Rk+i be the missing data indicator for the (k + l)-th visit. We use a 
logistic regression model similar to ([7]) as the MDM model: 


P(Rk +i = 1| Y k , Y k+ 1 ) = expit(h(Y k ) + rj k+1 Y k+l ). 


( 11 ) 
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Using Lemma 1, we have 


f(Y k+1 \D obs , 

Rk+l ~ 0) = f{Y k+1 \D obs , Rk+l — 1 ) 


exp(— rjk+iYk+i) 


E(exp(-r) k+1 Y k+1 )\D obs ,R = 1)' 


( 12 ) 


We can generate the missing part of Y k+i using the above density function for any 
given ij k+ 1 and then use those generated data to fill in the missing observations in 
Y k+ i, we denote the data up to the (k + l)-th visit as Y)~ +1 ^ Note that the imputed 
data for missing part of Y k+1 depends on the sensitivity parameter, so the notation is 
sub-indexed by i] k +i- 

The procedure goes on to generate the missing data for the (k + 2)-th visit in a 
similar way using formulas (HI]) and slightly adapted by 


P(Rk+ 2 = l\Y k+hVk+i ,Y k+2 ) = expit (h{Y k+lrik+i )+rj k+2 Y k+2 ), 


and 


exp(-?/ fc+2 Yfc+ 2 ) 


f (Y k +2\D 0 bs,n k+1 , Rk+2 0 ) f (Y k +2\D 0 bs,r] k+1 , 


Rk+2 ^ E(exp(—r) k+ 2Y k+ 2)\D 0 b S!Vk+1 , R = 1) ’ 


where D 0 b s ^ k+1 includes all the observed data and the inputed data for missing part in 
the (k + l)-th visit. The above density function can be used to generate missing data 
and fill in the missing part for the (k + 2)-th visit. Keeping up with the notation, i ] k +2 
is the sensitivity parameter in this step. 

The procedure continuous until all the missing observations are filled in. In this 
case, the generated data depends on a sensitivity parameter vector 77 = 1 ,..., r] M ) T . 

For each visit starting from the (k + l)-th, we can form a subset of selected data based 
on the generated data and the selection model, and then compare it with the actual 
observations. This can be used to conduct a simulation-based sensitivity analysis for 
the parameter rj. 

Remark 4. The aforementioned method may be very time consuming if the dimension 
of 77 is large. In this case, we may focus on a few visits which have a large missing rate 
and the MDM may be non-ignorable. We can then use SSA method for those visits by 
considering a grid of 77, and assume a MAR MDM and then use multiple imputation 
for the other visits. 


15 




3.2.1 Children weight data — a bivariate outcome with dropout 


Here we will apply the simulation-based sensitivity analysis to a real example. The 
1970 British Cohort Study (BCS70) follows the lives of more than 17,000 people born 
in England, Scotland and Wales. Two sub-samp les in the first 5 years of birth were col¬ 
l ected: a 22-month subsample rtChamberlain et al.l. 1 19 70h and a 42-month subsample 


([Chamberlain! . 


1973 b We are interested in the average Children’ weight at 42-month 


from birth. The 22-month subsample has a total of 2348 individuals on which infor¬ 
mation was collected; the mean and standard deviation (SD) for those observations 
are 11.92 and 1.56kg respectively. Likewise, the 42-month subsample has 1856 obser¬ 
vations (i.e. data on 492 people are missing for unknown reasons) with a mean and 
SD of 15.04 and 1.98kg respectively. We believe that the dropping out mechanism is 
associated with the first outcome {Y\ ) and may be also associated with the current 
outcome ( Y 2 ). 

We now conduct a simulation-based sensitivity analysis as discussed in the previous 
subsection. We choose r] within the interval (-0.2, 0.2) and consider any values beyond 
this range as not very likely. This is because for example, if ij < —0.2 , then at least 
94% of the 494 unobserved Children would have weights greater than 20 kg which is 
not not a realistic outcome when considering the mean of 15.04 and standard deviation 
of 1.98 for the observed data. 

In this example, we simply use a linear regression model to obtained the density 
function of observed Y 2 given the observed data, and a linear model: h(Y\) = /3o + (5\Y\ 
in (0. Then, the parameters (/3 0 , (3 1 ) can be estimated from the observed data when 
the value of sensitivity parameter r] is given. 

The final results are shown in Figure |3j The plausible set indicates that the mean 
estimation of the 42 month weight is in the range of (14.51, 16.19) with the most 
plausible value of 15.22. The average distances are presented in Figure l3bl providing 
further evidence that the sensitivity models near the bottom (corresponding to the 
most plausible model) are much closer to the true model than the others. 

It is interesting to note that the sensitivity model with a MAR assumption (rj = 0) 
is also included in the plausible set as shown in Figure [31 Although the MAR model is 
not the model having the shortest distance to the true model, there is no evidence to 
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reject it. Or in other words, the non-ignorability of the missingness is not severe. This 
is a by-product of using the SSA method. 



- 0.2 - 0.1 0.0 0.1 0.2 


(a) fi 



H-1-1-1-1-r 

13 14 15 16 17 18 


(b) average distance 


Figure 3: Children weight data: estimation of mean weight ji and the average KNN 
distance for a set of sensitivity parameter; ‘MAR’ stands for the model with MAR 
assumption. 


4 Discussion 

Missing data can usually be modelled by a sensitivity model f(D, R] 9 , 77 ), but the esti¬ 
mate of 9 , the parameter of interest, is often biased unless the true value of sensitivity 
parameter 77 is given. Unfortunately 77 is often inestimable based on the observed data 
D 0 b s particularly for non-ignorable missing data. A sensitivity analysis investigates 
how sensitively the estimate of 9 depends on the choice of 77 . To get some meaningful 
results, we need to use some statistical quantities which have clear physical interpre¬ 
tation. The simulation-based sensitivity analysis method we proposed is based on the 
KNN distance, ft can be applied in a general fashion and can help in gaining new and 
further insights into the problem at hand. The method can be applied to a great variety 
of missing data and model misspecihcation problems as we have shown in four differ¬ 
ent examples throughout this manuscript. The numerical studies show that meaningful 
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results can be obtained for each of those different problems. 

One of the key steps when conducting a simulation-based sensitivity analysis is to 
be able to simulate missing data when rj is given. Although this may be straightforward 
for some problems such as the meta-analysis with publication bias discussed in Section 
3.1, in many other cases it will not be a trivial step; an example of this was shown in 
Section 3.2 which required of a somewhat elaborate process to tackle the generation of 
the missing observations. Other models can be treated similarly although a different 
array of techniques may be needed as shown in the extended examples provided in 
Appendix. 

Whereas the most plausible value of // will provide a final result, in certain occa¬ 
sions it may be more advisable to work with a range of solutions. As we discussed 
for the meta-analysis example in Section 3.1, there may be a set of rj for which the 
corresponding KNN distances are all very close to the shortest distance. In those cases, 
we recommend using a set of estimates within a small range rather than a single value. 
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Appendix 


A.l K nearest-neighbour (KNN) distance 


Two observations x r a nd x 3 are defined to be neighbours if (see Definition 1, page 364 


Wong and Lane, 


1 9831 ): 


d(xi, Xj) < d k (xi) or d k (xj), 

where d is the Euclidean metric and d k (xi) is the kth nearest-neighbour distance to 
point Xi . Prior to defining the distance between two clusters using the KNN method, 
the distance between an individual point and a cluster needs to be formalized first. 
Given a cluster D = {x i: i = 1, an individual observation x is said to be 

neighbour of cluster D if there exists at least one point Xi in cluster D that 


d(xi,x) < d k (xi). 


where d is the Euclidean metric and d k (xj) is the distance of fcth nearest-neighbour 
within the cluster to the point ay. 

Now we can define the similarity and the distance between two clusters. Let D* = 
{x*,j = 1,... ,m} be another set; we expect that most of observations in D* are in 
the nearest-neighbour of D if the two clusters are quite similar or in ’close proximity’. 
A measure of how close both cluster are is given by the proportion of observations in 
D*. This is formally defined by the average E(/i) where its j-th element has the form: 



b E x^oUidi^x*) < d k (xi)} > 0; 

0, otherwise. 


for j 


1 


(13) 


Here / is indicator function which takes 1 if the condition is satisfied and 0 otherwise, 
and thus ^2 Xi&D {I(d(xi, x*) < d k (xi)} takes an integer in the set {0,1,2, ..n}. Only 
when it equals to 0, the observation x 3 is not the nearest-neighbour of cluster D. 
Similarly, we can define the proportion E(/ 2 ) of the points in D with 


T (i) _ l b < d k (x*)} > 0; 

1 2 — \ J tor i — i,. 


0, otherwise. 

The average of E{I\) and E(I 2 ) 


( 14 ) 


8(D,D*) = -(E(I 1 ) + E(1 2 )) 
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is considered as a similarity measure between D and D* and (1 — s(D, D*)) is used as 
the ‘KNN distance’. 

Other measures such as Mahalanobis metric may also be used. 


A.2 Proof of Lemma [T| 


Using Bayes Theorem, we have 


f(T\D oba ,R = Q) = 


P(R = 0\D)f(T\D obs ) 


P(R = 0\D, 


obs) 


Similarly, we have 


f(T\D obs ,R=l) = 


P(R=l\D)f(T\D obs ) 


P(R=1\D, 


obs) 


This leads to the following equation 


f(T\D obs ,R = 0) = f(T\D obs ,R=l) 


P(R = 0\D)P(R = l\D obs ) 


P(R=l\D)P{R = 0\D oba )' 

Let Q(D) be the one defined in fl9J), then we have 

P(R = 0| D obs , T ) 


E{Q{D)\D obsi R=l) = 


P{R=l\D obs ,T) 
f(T\D 0 ,R = l) 
P(R=l\D obs ,T) 
f{T\D obs ' 


f(T\D 0 ,R=l)dT 
P(R = 0| D obs ,T)dT 


P{R — 1 \P > obs) 


P(R — 01 D obs , T)cLT 


f(T\D obs , R = 0) ^ °^° bs) dT 


P(R= 1| D 


obs ) 


P(R = 0|L> O b S ) 
P(R = 1 \D obs ) 


This proves the Lemma. 


A.3 Additional examples 

A.3.1 Mean estimation with non-ignorable missing data 

Assume that a continuous variable X has mean /x and variance cr 2 . The population 
mean /x is of interest. The complete dataset is D — (X obs , X mis ) where the MDM 
depends on the missing value, and thus it is a non-ignorable missing data problem. 
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Let R be missing data indicator and further assume that the MDM is modeled by a 
logistic model 

P(R = l\X — x) — expit{7/(x + A)} (15) 

where expit (x) = exp(x)/(l+exp(x)) and A is assumed to be known (it can be estimated 
if the proportion of the missing data is known). The choice of r] = 0 specifies that R 
and X are independent (i.e., MAR). 

The parameter of interest is /i = E(X), the mean of the complete data, can be 
expressed by 


li = E(X\R = 1 )P(R = 1) + E(X\R = 0 )P(R = 0) = tt/p + (1 - 7t)// 2 , 


where 7r = P(R = 1) is the proportion of the observed data; /xi and /i 2 are the means 
of the observed and the missing data respectively. So the evaluation of /i 2 is the main 
task. Using Bayes theorem, we have 


f(x\R = 0) 


P(R = 0|x)/(x) 

P(R = 0) 

P(R = 1) P(R = 0\x) 


f(x\R = 1) 


P(R = 0)P{R= l|x) 


(16) 


Denote that tt x = P(R = 0|x), then 

P(R = 0|x) 1 — TT X 1 

P(R—l\x) 7 t x exp(x/(x + A)) 

is the odds of missing when X = x. The second equation comes from MDM model 
(|T5]h The mean of the missing data can therefore be expressed by 


fi 2 = j xf(x\R = 0 )dx 

= f xf(x\R — 1 ) t —- ( ] v A x 

J 1 - 'Kexp{r)[x + A)) 

* P r £ 1 

1 — 7r ' Y|R - 1 [exp(?/(X + A))_ ' 

In this example, rj is the sensitivity parameter. It is clear that we are unable to estimate 
this parameter from the observed data alone as it depends on the missing data as well. 

Equation (fT6j) is the key to conduct a simulation-based sensitivity analysis as it is 
needed in step (iii) of the SSA procedure to simulate data. 

We now conduct a simulation study to demonstrate the use of the SSA approach in 
order to address the non-ignorable missing data problem. The data is generated from 
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a normal distribution X ~ iV(//, cr 2 ) with the true values selected as [i = 0, a 2 = 1 
and rj = — 1 , A = 0 in model f|15p . indicating an average missing proportion of about 
50%. Sample size of the complete data is 100. In this example A is assumed to be fixed 
and rj is treated as a sensitivity parameter. The SSA approach is designed as follows. 
We first select a series of r) chosen from the interval of (-5,5). For each selected r), we 
evaluate the density f(x\R = 0) by (fIBl) and then use the density function to sample 
the missing data, denoting the imputed values as x m i S)ri . Thus, D n = ( x 0 b s , x m i Sjri ) forms 
a simulated complete dataset. To compare the simulated dataset with the observed 
data, we further generate a set of x° bs from D v using MDM (I15p with the given value 
of rj. The simulated set of x° bs is comparable with the observed dataset x 0 b s ■ Finally, 
the closeness of x° bs and x 0 b s is evaluated using the KNN distance. 

Figure |4] shows the results with the MC size of 1000. The KNN distance takes the 
minimum at rj = —0.89 when K = 2. The corresponding estimate is ji = —0.064 which 
is quite close to the true value. We also used the permutation test to find the plausible 
set of the sensitivity parameter. The dashed line in Figure [4] indicates the critical value 
at a significance level of 5%; all the ones below that line forms the set of plausible 
values resulting in estimates of /j in the range of (-0.82, 0.91) when K=2. We have 
also repeated the simulation for other values of K. As shown in the same figure, all of 
them give similar results although the values of the KNN distance is less sensitive to 


Hall et al. 

(2008 

) and 

Nigsch et al. 

(2006) 


Table Q] presents the simulation study results for 500 replications with different 
MC sizes and different values of K. The value of rj is selected by the shortest KNN 
distance with /I being the most plausible estimate. If we just use the sample mean of 
the observed data, the average of fi is about -0.5 which is far away from the true value 
of 0. Table Q] shows that the SSA approach gives much better results. This is also 
supported by the values of coverage probability, which are also reported in Table [0 
The results also show that the estimates are quite consistent for different values of 
K even for small number of the MC sizes. Histograms of rj for a range of MC sizes are 
provided in Figure 0 As it can be seen, the method with a MC size of 100 or above 
when K = 2 usually gives quite robust results. 
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Bias parameter selection Bias parameter selection Bias parameter selection 



Figure 4: Mean estimation with non-ignorable missing data: Upper panel - KNN 
distances versus different values of 77; Lower panel - KNN distance versus the corre¬ 
sponding estimate of /i for the given value of 77. The dashed line indicates the critical 
values at 5% significance level. 

Table 1: Mean estimation with non-ignorable missing data: simulation study 


average of /t (sd) Coverage Probability (%) 


MC size 

10 

20 

100 1000 

10 

20 

100 

1000 

K=2 

-0.065 (0.16) 

-0.069(0.16) 

-0.064(0.16) -0.064(0.16) 

80.8 

78.6 

89.4 

91.6 

K=3 

-0.133(0.16) 

-0.138(0.16) 

-0.158(0.16) -0.154(0.16) 

73.4 

72.8 

78.2 

80.4 

K=4 

-0.143 (0.17) 

-0.152 (0.16) 

-0.190(0.16) -0.191(0.16) 

71.6 

67.4 

68.6 

68.8 

K=5 

-0.169(0.17) 

-0.187(0.17) 

-0.211(0.16) -0.204(0.16) 

70.6 

64.2 

62.0 

66.0 


A.3.2 Regression analysis with missing covariates 

We now consider a regression problem with missing confounders. Let D = ( D 0 b s , D mis ), 
where D 0 b s denotes the data where both the response variable and the covariates are 
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Table 2: Mean estimation with non-ignorable missing data: selection of q 


average of selected q 


MC 

size 

10 

20 

100 

1000 

K= 

=2 

-0.85 

-0.87 

-0.88 

-0.89 

K= 

=3 

-0.73 

-0.71 

-0.65 

-0.67 

K= 

=4 

-0.69 

-0.67 

-0.57 

-0.57 

K= 

=5 

-0.58 

-0.59 

-0.54 

-0.54 


MC size 10 


MC size 20 





il 


K=2 


K=2 


MC size 100 


MC size 1000 



-4 -2 0 2 4 


n 

K=2 



-4 -2 0 2 4 


il 

K=2 


Figure 5: Mean estimation with non-ignorable missing data: histograms of the selected 
q with different MC sizes. 


observed; while D mis denotes the data where the response variable is observed but the 
covariates are only partly observed. We still use R as the missing data indicator and 
assume that the MDM depends on both D 0 b s and D mis . The joint density of data D 
and the indicator R is expressed by f(R,D), which can be factorized as f(R\D)f(D). 
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We use the following semiparametric model (Ki m and Yu, 201 lj) as the selection model 
for f{R\D): 

logit{P(i? = 1| D obs , D mis )} = h(D obs ) + r]D mis , (17) 

where h(-) is a nonparametric function and ij is a sensitivity parameter. This model 
takes a nonparametric model on the observed part D obs and a simple linear form on 
the missing variable D m i S . Equation (fT7| can be rewritten as 


tt d = P(R = 1| D) = expit (h{D obs ) + rjD mis ). 


Using Lemma 1, we have 

f (D m i S \D obS i R = 0) = f(D mis \D obs , R = 1) 


exp (-J]D mis ) 


(18) 


(19) 


P(exp( tiD m i S ')\D obs , R 1) 

The sensitivity parameter ?/ determines the amount of departure from the ignorability 
of the MDM. In formula (1X9)1 . we need to compute the conditional distribution of 
missing data f{D mis \D obs ,R = 1). A consistent estimate of f(D mis \D obs , R = 1) can 
be obtained parametrically or non-parametrically; an example is shown below. 

To use the SSA approach we first choose a value of 77 from a predetermined set T. 
For each r] we simulate missing data from either ([ 8 ]) or (1T9]1 . Together, the (simulated) 
missing and the observed data form a dataset which is complete. From here, a subset 
D° bs can be re-sampled using the MDM specified in (fT7jl and the KNN distance between 


~)obs 

D° bs and the true observed data D obs can finally be calculated. 


We now conduct a simulation study for non-ignorable missing covariate problem 
based on a real data example. 

The US Federal Highway Administration published fuel consumption data over 50 
states and the District of Columbia in 2001; seeIWe isbergj (2005) (Chapter 1, page 15). 


The aim of the research is to understand the effect on fuel consumption (T) of several 
covariates namely, the number of federal-aid highway miles (Ad), the personal income 
(X 2 ), the number of licensed drivers (A 3 ) and the state gasoline tax rate (X 4 ). 

We consider the following linear regression model 

4 

t — do t 'y ^ @j x j + £■ ( 20 ) 

3 =1 

By using the complete dataset, estimates for ( 6 * 0 , 9i, 9 2 , 9 3 , 9 4 ) can be obtained and they 
are presented in the first row in Table [3]. 









We now consider a missing data problem by letting income (X 2 ) to be partly missing 
with probability P(R = 0|Z?) = 1 — expit(1 + {x\ — X\) — 0.5(x 2 — x 2 )), where R is the 
missing data indicator and x is the sample mean. This is a non-ignorable missing data 
problem. 

To use the SSA approach we resort to the semi-parametric selection model in (ITT)) , 
i.e. 


P(R — l\t, x 1 ,x 2 ,x 3 ,x 4 ) = expit (h(t,x 1 ,x 3 ,x 4 ) + rjx 2 ). (21) 


Following the discussion around ([T9]) , we simulate the missing data from: 

exp(—r/x 2 ) 


f(x 2 \t,x 1 ,x 3 ,x 4 ,R = 0) = f(x 2 \t,x 1 ,x 3 ,x 4 ,R = 1)- 


E(exp(-r)x 2 )\t, x 4 , x 3 , x 4 , R = 1) ’ 
where f(x 2 \t,x 4 ,x 3 ,x 4 ,R = 1) is modelled by the following parametric model: 

(x 2 \t,x l ,x 3 ,x 4 ,r = 1) ~ iV( 7 o + yit + 72 X 1 + 73 X 3 + 7 4 x 4 ,t 2 ), 


where the unknown parameters ( 70 , 71 ,..., 7 4 , r 2 ) can be estimated from the observed 
data and X 2 (personal income) is assumed to be normally distributed. 

The simulated data for missing x 2 and the observed data can now be used to 
form a ‘complete’ dataset D v = (t, x 4 , x 2 , x 3 , x 4 ) for each given 7 . Using D n we can 
estimate the parameter ( 6 V , cr 2 ) from the linear regression model (1201) . I 11 order to make 
the SSA method more reliable numerically, we build on the idea of ‘bootstrapping 


residuals’ introduced in (Efron and Tibshiranil . 11993 ). Instead of using D v directly, we 
use D* = (t*, Xi,X 2 ,x 3 , x 4 ), where t* is re-sampled conditionally on the following linear 
regression model with estimates (d^jCr 2 ) and imputed covariates x* = ( 27 , x\, x 3 , x 4 ) t : 


t*\x 4 , x* 2 , x 3 , x 4 ~ N(eEx*,cr 2 ), 


i.e. t* is simulated by adding residuals on the predicted values, where the residuals 
are sampled from a normal distribution A^(0,d 2 ). And, finally, we then calculate the 
distance between D°^ s * = (t*,xi,x 3 ,x 4 ) and the observed dataset D a b s = (t,x 4 ,x 3 ,x 4 ) 
by using an average distance a MC size of 100 (this will reduce the sampling error). 

I 11 this example tj is selected from within the interval (—5, 5) at equal-distance of 
0.2 unit. Figure [H] shows the KNN distances with K = 2 against the values of 7 . As 
it can be seen, the ‘most plausible’ value or minimum distance is achieved at around 
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rj = —0.6. It is also interesting to note that the corresponding estimates are very close 
to the ones obtained from the complete data. The ASL using the permutation test has 
also been calculated but its value is well above the critical value of 0.05 and hence in 
not showing in the plot. 


1 



(a) K=2 



- 1 - 1 - 1 - 1 - 1 - 

-4 -2 0 2 4 

V 

(b) K=3 

Figure 6: KNN distance for the fuel consumption data. 


The simulation study results with 100 replications are presented in Table [3j The 
row named as ‘SSA’ shows the average values for the ‘most plausible estimate’. As 
seen, all the estimates are very close to the estimates calculated from the complete 
data except 62 . This is further evidence that the most plausible model obtained via a 
SSA study needs to be chosen with caution. It may provide the ‘best’ results based on 
the observed data, but it may not be consistent. 
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As comparison, we also considered the complete case analysis, using multivariate 
imputation under MAR (‘MICE’ Packag e with Bayesian Linear regression method; 


see 


van Buuren and Groothuis-Oudshoorn ( 20111 )) and multivariate imputation under 


MNAR (iteration step for sensitivity analysis, see Resseguier (2 011 )). Note the latter 
method modifies the imputation model under a MAR hypothesis to MNAR hypoth¬ 
esis by specifying a supplementary parameter (SupPar). For binary or categorical 
variables, this parameter is the odds ratio (i.e. the ration between the odds of the 
modality interest among subjects with missing values and subjects without missing 
value). Likewise, for continuous variables, this is the difference in expected values. 


As a comparison, we 


posed by 


rave also applied Monte Carlo sensitivity analysis (MGSA) pro- 


McCandless et al. 


Green land (20051 and the Bayesian sensitivity analysis (BSA) proposed by 


020071) . The prior distribution f{rj) is selected as U(-5,5) for both 


the MCSA and the BSA methods. As it can be seen in the simulation results with 100 
replications presented in Table |3l overall the SSA method performs more favorably and 
robustly than the others. 
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Table 3: Simulation study for fuel consumption data 



0 o (se) 

k 

k 

k §4 

Complete data 

154.19(9.08) 

18.55(6.47) 

-6.14(2.19) 

0.47(0.13) -4.23(2.03) 

Complete case 

155.06(14.44) 

16.20(12.41) 

-7.32(3.37) 

0.24(0.18) -7.29(2.24) 

Multiple Imputation: 

MAR 

130.15(13.11) 

30.68(8.5) 

-9.86(4.13) 

0.50(0.15) -5.17(2.17) 

MNAR (SupPar=l) 

137.70(12.03) 

24.24(8.95) 

-9.74(3.91) 

0.43(0.14) -5.42(2.37) 

MNAR (SupPar=2) 

141.26(11.27) 

22.84(8.87) 

-9.68(3.65) 

0.42(0.14) -5.33(2.34) 

MNAR (SupPar=5) 

152.11(9.69) 

19.45(8.27) 

-8.19(2.88) 

0.42(0.14) -5.01(2.24) 

MNAR (SupPar=8) 

159.49(9.36) 

17.69(7.73) 

-6.34(2.24) 

0.43(0.14) -4.75(2.17) 

Sensitivity Analysis: 

SSA (K=2) 

152.30(8.29) 

18.24(5.59) 

-3.99(2.18) 

0.49(0.12) -4.39(1.85) 

SSA (K=3) 

154.12(8.22) 

17.01(5.57) 

-4.05(2.08) 

0.49(0.11) -4.38(1.83) 

MCSA 

153.62(13.22) 

22.87(6.94) 

-3.44(1.87) 

0.49(0.13) -4.29(2.07) 

BSA 

131.51(10.58) 

31.77(6.49) 

-10.17(2.82) 

0.45(0.12) -4.75(2.00) 
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